Data wrangling I — verbs
2025-02-17
In this section of the data wrangling topic you will
Learn the key data wrangling verbs
Learn how to summarise data
Wrangling data is (or should be) the most time-consuming aspect of a study — once collected!
Wrangling defined as
engagement in a long, complicated dispute or argument
round up, herd, or take charge of (livestock)
alternative to wangle, to manage to obtain (something) by persuading or cleverly manipulating someone
Better alternative to
The dplyr package provides five key functions to solve most of the data wrangling issues you’ll face
filter()arrange()select()mutate()summarise()Each of these functions can be used with group_by() so that they affect each subset of data (group) in turn
These are the key data wrangling verbs
All verbs work in the same way, providing a consistent workflow with known outputs
the first argument to the verbs is always a data frame
subsequent arguments define what you want to with the data frame
the result of each verb (what is returned by the function) is always a data frame
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
filter()Subset observations based on the values or one or more of the columns
penguins |>
filter(island == "Torgersen")
## # A tibble: 52 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## 7 Adelie Torgersen 38.9 17.8 181 3625
## 8 Adelie Torgersen 39.2 19.6 195 4675
## 9 Adelie Torgersen 34.1 18.1 193 3475
## 10 Adelie Torgersen 42 20.2 190 4250
## # ℹ 42 more rows
## # ℹ 2 more variables: sex <fct>, year <int>filter()Combine filters using boolean operators
penguins |>
filter(sex == "female" & island == "Biscoe")
## # A tibble: 80 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Biscoe 37.8 18.3 174 3400
## 2 Adelie Biscoe 35.9 19.2 189 3800
## 3 Adelie Biscoe 35.3 18.9 187 3800
## 4 Adelie Biscoe 40.5 17.9 187 3200
## 5 Adelie Biscoe 37.9 18.6 172 3150
## 6 Adelie Biscoe 39.6 17.7 186 3500
## 7 Adelie Biscoe 35 17.9 190 3450
## 8 Adelie Biscoe 34.5 18.1 187 2900
## 9 Adelie Biscoe 39 17.5 186 3550
## 10 Adelie Biscoe 36.5 16.6 181 2850
## # ℹ 70 more rows
## # ℹ 2 more variables: sex <fct>, year <int>filter()Combine filters using boolean operators
penguins |>
filter(sex == "male" | bill_length_mm > 41)
## # A tibble: 265 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.3 20.6 190 3650
## 3 Adelie Torgersen 39.2 19.6 195 4675
## 4 Adelie Torgersen 42 20.2 190 4250
## 5 Adelie Torgersen 41.1 17.6 182 3200
## 6 Adelie Torgersen 38.6 21.2 191 3800
## 7 Adelie Torgersen 34.6 21.1 198 4400
## 8 Adelie Torgersen 42.5 20.7 197 4500
## 9 Adelie Torgersen 46 21.5 194 4200
## 10 Adelie Biscoe 37.7 18.7 180 3600
## # ℹ 255 more rows
## # ℹ 2 more variables: sex <fct>, year <int>|>The nature of dplyr verbs leads to a workflow where we create intermediate steps
group_by()summarise()We perhaps don’t need to store the output of the first operation
The pipe operator |> allows use to compose dplyr verbs into a workflow to achieve a data wrangling outcome
arrange()Sort the rows according to a condition or variable
penguins |>
arrange(bill_length_mm)
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Dream 32.1 15.5 188 3050
## 2 Adelie Dream 33.1 16.1 178 2900
## 3 Adelie Torgersen 33.5 19 190 3600
## 4 Adelie Dream 34 17.1 185 3400
## 5 Adelie Torgersen 34.1 18.1 193 3475
## 6 Adelie Torgersen 34.4 18.4 184 3325
## 7 Adelie Biscoe 34.5 18.1 187 2900
## 8 Adelie Torgersen 34.6 21.1 198 4400
## 9 Adelie Torgersen 34.6 17.2 189 3200
## 10 Adelie Biscoe 35 17.9 190 3450
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>desc()Arrange in descending order using desc()
penguins |>
arrange(desc(bill_length_mm))
## # A tibble: 344 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Gentoo Biscoe 59.6 17 230 6050
## 2 Chinstrap Dream 58 17.8 181 3700
## 3 Gentoo Biscoe 55.9 17 228 5600
## 4 Chinstrap Dream 55.8 19.8 207 4000
## 5 Gentoo Biscoe 55.1 16 230 5850
## 6 Gentoo Biscoe 54.3 15.7 231 5650
## 7 Chinstrap Dream 54.2 20.8 201 4300
## 8 Chinstrap Dream 53.5 19.9 205 4500
## 9 Gentoo Biscoe 53.4 15.8 219 5500
## 10 Chinstrap Dream 52.8 20 205 4550
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>select()Choose which variables to view or work with
penguins |>
select(c(species, sex, bill_length_mm, body_mass_g))
## # A tibble: 344 × 4
## species sex bill_length_mm body_mass_g
## <fct> <fct> <dbl> <int>
## 1 Adelie male 39.1 3750
## 2 Adelie female 39.5 3800
## 3 Adelie female 40.3 3250
## 4 Adelie <NA> NA NA
## 5 Adelie female 36.7 3450
## 6 Adelie male 39.3 3650
## 7 Adelie female 38.9 3625
## 8 Adelie male 39.2 4675
## 9 Adelie <NA> 34.1 3475
## 10 Adelie <NA> 42 4250
## # ℹ 334 more rowsselect() — many optionsmatches()
select() — many optionsfrom:to
penguins |>
select(species:flipper_length_mm)
## # A tibble: 344 × 5
## species island bill_length_mm bill_depth_mm flipper_length_mm
## <fct> <fct> <dbl> <dbl> <int>
## 1 Adelie Torgersen 39.1 18.7 181
## 2 Adelie Torgersen 39.5 17.4 186
## 3 Adelie Torgersen 40.3 18 195
## 4 Adelie Torgersen NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193
## 6 Adelie Torgersen 39.3 20.6 190
## 7 Adelie Torgersen 38.9 17.8 181
## 8 Adelie Torgersen 39.2 19.6 195
## 9 Adelie Torgersen 34.1 18.1 193
## 10 Adelie Torgersen 42 20.2 190
## # ℹ 334 more rowsselect() — many optionsstarts_with()
select() — many optionsMany options for choosing columns in select — see ?dplyr::select
summarise()Summarise data to a single row
group_by()summarise() becomes much more useful if we group data first
group_by()Can group by multiple variables; grouping applied in the order given
penguins |>
group_by(species, sex) |>
summarise(
avg_bill_length = mean(bill_length_mm, na.rm = TRUE)
)
## # A tibble: 8 × 3
## # Groups: species [3]
## species sex avg_bill_length
## <fct> <fct> <dbl>
## 1 Adelie female 37.3
## 2 Adelie male 40.4
## 3 Adelie <NA> 37.8
## 4 Chinstrap female 46.6
## 5 Chinstrap male 51.1
## 6 Gentoo female 45.6
## 7 Gentoo male 49.5
## 8 Gentoo <NA> 45.6summarise() can add any number of relevant summaries
mutate()mutate() creates new variable or modifies existing ones
penguins |>
group_by(species) |>
summarise(
sum_bill_length = sum(bill_length_mm, na.rm = TRUE),
n = n(),
avg_bill_length = mean(bill_length_mm, na.rm = TRUE)
) |>
mutate(avg_bill_length_2 = sum_bill_length / n)
## # A tibble: 3 × 5
## species sum_bill_length n avg_bill_length avg_bill_length_2
## <fct> <dbl> <int> <dbl> <dbl>
## 1 Adelie 5858. 152 38.8 38.5
## 2 Chinstrap 3321. 68 48.8 48.8
## 3 Gentoo 5843. 124 47.5 47.1Why don’t the two averages match?
mutate()The issue is the missing values; let’s fix that
penguins |>
filter(! is.na(bill_length_mm)) |>
group_by(species) |>
summarise(
sum_bill_length = sum(bill_length_mm, na.rm = TRUE),
n = n(),
avg_bill_length = mean(bill_length_mm, na.rm = TRUE)
) |>
mutate(avg_bill_length_2 = sum_bill_length / n)
## # A tibble: 3 × 5
## species sum_bill_length n avg_bill_length avg_bill_length_2
## <fct> <dbl> <int> <dbl> <dbl>
## 1 Adelie 5858. 151 38.8 38.8
## 2 Chinstrap 3321. 68 48.8 48.8
## 3 Gentoo 5843. 123 47.5 47.5